Systematic coding technique for erasure correction

ABSTRACT

Disclosed herein is a method for determining how to encode data in accordance with a systematic coding technique and encoding data in accordance with the determined systematic coding technique. The method includes: determining code parameters; determining source data nodes that comprise source data that is not encoded by the systematic coding technique; for each of the redundant nodes, determining to generate each of the substripes of data in dependence on a combination of a different substripe from each of the source data nodes; and determining each of one or more of the substripes of at least one of the redundant nodes to be further dependent on at least one further substripe of source data that it is not currently dependent on.

FIELD

The field of the invention is the systematic coding of data. Aparticularly preferred application for coded data according toembodiments is in a data storage system. The flexible code constructionallows the coding to be adapted to the underlying hardware of thestorage system configuration. In addition, the amount of data that isaccessed and transferred in order to reconstruct unavailable data issignificantly reduced from other coding techniques, such as Reed-Solomoncoding.

BACKGROUND

An ever increasing amount of data is being stored in large capacitydistributed data storage systems. It is normal for redundancy to beintroduced into stored data. The redundancy allows data within a node tobe obtained when the node is unavailable, for example due to maintenanceof the node or failure of part of the data storage system. Thetraditional approach to redundancy is replication of the stored data andone or more copies of the original data are stored in other node(s).Triple replication of stored data is an accepted industry standard.However, a problem with replication is that it has a high storageoverhead. There is therefore an increasing use of erasure codes to storedata as the property of being able to recover data is maintained and thestorage overhead greatly reduced. Reed-Solomon (RS) codes, for example,are a widely employed erasure coding technique.

In addition to the extent to which unavailable data can bereconstructed, there are other desirable properties in a distributeddata storage system, such as a small repair bandwidth andaccess-optimality. Repair bandwidth is the amount of transferred datarequired to repair a data storage unit in a data storage system in whichthe data is unavailable, referred to herein as a failed node.Access-optimality is achieved when the amount of accessed andtransferred data during the repair process is equal. There are two typesof repair of a failed node, functional and exact repair. Underfunctional repair, a failed node is replaced by a new node such that theresulting system continues to possess the data-reconstruction and repairproperties. With exact repair, the new node has exactly the same contentas the lost data. Exact repair is preferred over functional repair froma practical point of view.

In a (n, k, d) systematic erasure code for a data storage system, thereare n nodes that store data. Of the n nodes, k of the nodes store sourcedata that is data that has not been encoded by the erasure codingtechnique. There are r redundant nodes, where r=n−k. Each of the rredundant nodes stores data that has been coded according to the erasurecoding technique. The file size, i.e. amount of source data that isstored, is B. The amount of data stored in each node is A_(N), whereA_(N)=B/k. The data from a failed node is recovered by transferring datafrom d non-failed nodes (i.e. nodes that are available). The repairbandwidth β is greater than or equal to the lower bound that is A_(N).

A large number of erasure coding techniques exist with differingproperties. The codes are designed to have one or more of the desiredproperties of being Maximum-Distance Separable (MDS), systematic,achieving optimal repair bandwidth, and offering access optimality.

An erasure coding technique is presented in the paper G. K. Agarwal, B.Sasidharan, and P. Vijay Kumar, ‘An alternate construction of anaccess-optimal regenerating code with optimal sub-packetisation level’,National Conference on Communications (NCC), pages 1-6, February 2015,referred to herein as the Agarwal paper. In the Agarwal paper, the datain each node is stored in α substripes (i.e. sub-packets). Thereconstruction of unavailable data is performed by operations onsubstripes of data within available nodes. The sub-packetisation levelrepresents the minimum dimension over which all operations areperformed. When the sub-packetisation level is 1, as it is for standardRS codes, then each node is recovered by transferring data of size B/kfrom k nodes, i.e., the total amount of the transferred data is B. Thus,the repair bandwidth is equal to the file size when RS codes are used.By using a sub-packetisation level α>1, the repair bandwidth can bedecreased from B. The Agarwal paper discloses the use of asub-packetisation level of α=r^(m) where m=k/r. An essential conditionto be able to construct the disclosed codes in the Agarwal paper is thatm is an integer. The Agarwal paper does not disclose any technique forgenerating codes with any other sub-packetisation level than r^(m). Inaddition, the Agarwal paper does not disclose any technique forgenerating a codes for which m is not an integer.

A different coding technique from that in the Agarwal paper is disclosedin the paper ‘A “Hitchhiker's” Guide to Fast and Efficient DataReconstruction in Erasure-coded Data Centres’ by K. V. Rashimi et al,SIGCOMM 2014, Computer Communication Review, August 2014, referred toherein as the Hitchhiker paper. The Hitchhiker paper discloses atechnique for improving on RS coding by using a sub-packetisation levelof exactly 2 only.

There is a need to improve known erasure coding techniques.

SUMMARY

According to a first aspect of the invention, there is provided a methodfor determining how to encode data in accordance with a systematiccoding technique, the method comprising: determining the code parametersn, k, r and α, wherein n is the total number of nodes, k is the totalnumber of source data nodes, r is the total number of redundant datanodes, such that n=k+r, α is the number of substripes of data in one ofthe nodes and each of the source data nodes and redundant data nodescomprise the same number of substripes, and wherein α is determined sothat it satisfies the condition 1<α≤r^(m), where m=ceiling(k/r);determining source data nodes that comprise source data that is notencoded by the systematic coding technique; for each of the redundantnodes, determining to generate each of the substripes of data independence on a combination of a different substripe from each of thesource data nodes such that each of the substripes is generated independence on a combination of k substripes of source data and the αsubstripes of the redundant node are generated in dependence on all ofthe (α×k) substripes of source data; and determining each of one or moreof the substripes of at least one of the redundant nodes to be furtherdependent on at least one further substripe of source data that it isnot currently dependent on.

Preferably, (k/r) is not an integer.

Preferably, r is 2 or more, and/or r is 3 or more.

Preferably, α is 2 or more, and/or α is 3 or more.

Preferably, α is determined so that it satisfies the condition 1<α<r^(m)and/or (k/r) is not an integer.

Preferably, one of the redundant nodes is determined to have each of itssubstripes dependent on exactly k substripes of source data and thesubstripes of the redundant node are generated in dependence on all ofthe α×k substripes of source data.

Preferably, said determining of each of one or more of the substripes ofat least one of the redundant nodes to be further dependent on at leastone further substripe of source data that it is not currently dependenton is performed for r−1 redundant nodes.

Preferably, determining each of one or more of the substripes of aredundant node to be further dependent on at least one further substripeof source data that it is not currently dependent on is performed forall of the substripes of the node.

Preferably, the method further comprises performing a balanced selectionof the substripes that the redundant nodes are further dependent on suchthat substantially the same number of read operations are required torecover each source node.

Preferably, the method is computer-implemented.

Preferably, the determined systematic coding technique is MDS.

Preferably, the combining of substripes of source data to generate asubstripe of a redundant node is by linear combinations over finitefields.

Preferably, the systematic coding technique is an erasure resilientsystematic coding technique.

Preferably, (k/r) is an integer.

Preferably, said step of determining each of one or more of thesubstripes of at least one of the redundant nodes to be furtherdependent on at least one further substripe of source data that it isnot currently dependent on is performed in accordance with any of arandom determination, a pseudo random determination, a pre-determinedtechnique and/or an algorithm.

Preferably, when determining each of one or more of the substripes of aredundant node to be further dependent on at least one further substripeof source data that it is not currently dependent on, the selection ofeach further substripe of source data is independent from the order ofwriting and reading the previous substripes of source data.

Preferably:

-   -   said step of determining source data nodes comprises determining        k source data nodes {d₁, d₂, . . . , d_(k)} where each data node        d_(j) comprises an indexed set of α substripes {a_(1,j),        a_(2,j), . . . , a_(α,j)} as a two-dimensional array    -   Data with α rows and k columns such that

${{Data} = \begin{bmatrix}a_{1,1} & a_{1,2} & \ldots & a_{1,k} \\a_{2,1} & a_{2,2} & \ldots & a_{2,k} \\\vdots & \vdots & \ddots & \vdots \\a_{\alpha,1} & a_{\alpha,2} & \ldots & a_{\alpha,k}\end{bmatrix}};$

-   -   and    -   the generation of the redundant nodes comprises:    -   determining r redundant data nodes {p₁, p₂, . . . , p_(r)} where        each redundant node p_(l), where 1≤l≤r, comprises of an indexed        set of α substripes {p_(1,l), p_(2,l), . . . , p_(α,l)};    -   determining r two-dimensional index arrays P₁, . . . , P_(r);    -   determining the index array for P₁ to have α rows and k columns,        where each cell in P₁ is a pair of indexes with the following        values:

${P_{1} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \ldots & \left( {1,k} \right) \\\left( {2,1} \right) & \left( {2,2} \right) & \ldots & \left( {2,k} \right) \\\vdots & \vdots & \ddots & \vdots \\\left( {\alpha,1} \right) & \left( {\alpha,2} \right) & \ldots & \left( {\alpha,k} \right)\end{bmatrix}};$

-   -   determining the index arrays P₂, . . . , P_(r) to have α rows        and k+m columns, and where each cell in P₁, where 2≤l≤r, is a        pair of indexes with the following values:

${P_{l} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \ldots & \left( {1,k} \right) & \left( {?{,?)}} \right. & \ldots & \left( {?{,?)}} \right. \\\left( {2,1} \right) & \left( {2,2} \right) & \ldots & \left( {2,k} \right) & \left( {?{,?)}} \right. & \ldots & \left( {?{,?)}} \right. \\\vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \left( {?{,?)}} \right. \\\left( {\alpha,1} \right) & \left( {\alpha,2} \right) & \ldots & \left( {\alpha,k} \right) & \left( {?{,?)}} \right. & \ldots & \left( {?{,?)}} \right.\end{bmatrix}},$

where the pairs with values (?,?) are further determined according to analgorithm; and

-   -   for each of the redundant nodes {p₁, p₂, . . . , p_(r)},        determining to generate each of the substripes p_(1,l), where        1≤i≤α and 1≤l≤r, in dependence on a combination of different        source data substripes a_((j) ₁ _(,j) ₂ ₎, where the pair        (j₁,j₂) is present in the i-th row of the index array P_(l).

According to a second aspect of the invention, there is provided amethod for determining how to generate a systematic code, the methodcomprising: receiving the code parameters n, α, k and/or r; configuringan algorithm with the received code parameters; determining, by thealgorithm, how to generate a systematic code in accordance with themethod of the first aspect.

Preferably, n, k, α are inputs to the algorithm and index arrays P₁, . .. , P_(r) are outputs that define how to generate each of the redundantnodes, and wherein the algorithm performs the steps of:

-   -   initialising P₁, . . . , P_(r) as arrays P=((i,j))_(α×k);    -   appending additional m=┌k/r┐ columns to P₂, . . . , P_(r) all        initialized to (0,0);

$\left. {{setting}\mspace{14mu} {portion}}\leftarrow\left\lceil \frac{\alpha}{r} \right\rceil \right.;$

-   -   setting ValidPartitions←∅;    -   setting←0;    -   repeating the steps of:

setting  j ← j + 1;$\left. {{setting}\mspace{14mu} v}\leftarrow\left\lceil \frac{j}{r} \right\rceil \right.;$$\left. {{setting}\mspace{14mu} {run}}\leftarrow\left\lceil \frac{\alpha}{r^{v}} \right\rceil \right.;$$\left. {{setting}\mspace{14mu} {step}}\leftarrow{\left\lceil \frac{\alpha}{r} \right\rceil - {run}} \right.;$

-   -   -   determining        -   D_(d) _(j)            =ValidPartitioning(ValidPartitions,k,r,portion,run,step,J_(v));        -   setting ValidPartitions=ValidPartitions ∪D_(d) _(j) ; and        -   determining one D_(ρ,d) _(j) ∈D_(d) _(j) such that its            elements correspond to row indexes in the (k+v)-th column in            one of the arrays P₂, . . . , P_(r), that are all zero pairs            (0, 0), wherein the indexes in D_(ρ,d) _(j) are the row            positions where the pairs (i,j) with indexes i∈D\D_(ρ,d)            _(j) are assigned in the (k+v)-th column of P₂, . . . ,            P_(r);

    -   until (run>1) AND (j≠0 mod r);

    -   while j<k, performing the steps of:        -   setting j←j+1;        -   setting v←┌j/r┐;        -   setting run←0;        -   determining        -   D_(d)=ValidPartitioning(ValidPartitions,k,r,portion,run,step,J_(v));        -   setting ValidPartitions=ValidPartitions ∪ D_(d) _(j) ;        -   determining one D_(ρ,d) _(j) ∈D_(d) _(j) such that its            elements correspond to row indexes in the (k+v)-th column in            one of the arrays P₂, . . . , P_(r), that are all zero pairs            (0, 0), wherein the indexes in D_(ρ,d) _(j) are the row            positions where the elements (i,j) with indexes i∈D\D_(ρ,d)            _(j) are assigned in the (k+v)-th column of P₂, . . . ,            P_(r); and        -   when the condition j<k is no longer satisfied, outputting            the determined P₁, . . . , P_(r);

    -   wherein, the steps of the algorithm further comprise:

    -   partitioning the set Nodes={d₁, . . . , d_(k)} of k data disks        in ┌k/r┐ disjunctive subsets

$J_{1},\ldots \mspace{14mu},J_{\lceil\frac{k}{r}\rceil}$

where |J_(v)|=r and where if r does not divide k then the last subset

$J_{\lceil\frac{k}{r}\rceil}$

has k mod r elements and where

${{Nodes} = {\bigcup_{v = 1}^{\lceil\frac{k}{r}\rceil}{J_{v}.}}};$

-   -   wherein the function ValidPartitioning is called by the        algorithm and takes ValidPartitions, k, r, portion, run, step,        J_(v), as inputs and outputs D_(d) _(j) =D_(1,d) _(j) , . . . ,        D_(r,d) _(j) ; and    -   the ValidPartitioning function comprises the steps of:    -   setting D={1, 2, . . . , a};    -   if run≠0 then        -   finding D_(d) _(j) that satisfies Condition 1 and Condition            2;    -   else        -   finding D_(d) _(j) that satisfies Condition 2;    -   where Condition 1 is that at least one subset D_(ρ,d) _(j) has        portion elements with runs of run consecutive elements separated        with a distance between the indexes equal to step, wherein the        elements of that subset correspond to row indexes in the        (k+v)-th column in one of the arrays P₂, . . . , P_(r), that are        all zero pairs (0, 0), and the distance between two elements in        one node is computed in a cyclical manner such that the distance        between the elements a_(α−1) and a₂ is 2; and        -   Condition 2 is that a necessary condition for the valid            partitioning of the elements in the systematic nodes to            achieve the lowest possible repair bandwidth is

D_(d_(j₁)) = D_(d_(j₂))

for all d_(j) ₁ and d_(j) ₂ in J_(v) and

D_(d_(j₁)) ≠ D_(d_(j₂))

for all d_(j) ₁ and d_(j) ₂ systematic nodes in the system, and ifportion divides α, then D_(ρ,d) _(j) for all d_(j) in the J_(v)-thsubset are disjunctive, i.e., D=∪_(j=1) ^(r) D_(ρ,d) _(j) ={1, 2, . . ., α}.

Preferably, the combining of substripes of source data to generate asubstripe p_(i,l) of the redundant node p_(l) is by linear combinationsover finite fields according to the index arrays P₁, . . . , P_(r);wherein p_(i,l)=Σc_(l,j) ₁ _(,j) ₂ a_(j) ₁ _(,j) ₂ , where 1≤i≤α, 1≤l≤rand the pair (j₁,j₂) exists in the i-th row of the index array P_(l) andco-efficient c_(l,j) ₁ _(,j) ₂ is some nonzero element in the finitefield.

According to a third aspect of the invention, there is provided a methodfor storing data in a data storage system, wherein n is the total numberof data storage nodes of the data storage system, k is the total numberof source data nodes of the data storage system, r is the total numberof redundant data nodes of the data storage system, such that n=k+r, αis the number of substripes of data in one of the nodes and each of thesource data nodes and redundant data nodes comprise the same number ofsubstripes, and wherein α satisfies the condition 1<α≤r^(m), wherem=ceiling(k/r), the method comprising: determining how to encode thesource data for storing in the data storage system in dependence on thesystematic coding technique of the first or second aspect; determiningthe redundant data by encoding the source data in accordance with thedetermined systematic coding technique; and storing the source data andthe redundant data in the data storage nodes of the data storage system.

Preferably, the method further comprises performing a mapping operationon the source data and encoded source data such that one or more of thedata storage nodes stores both source data and encoded source data.

According to a fourth aspect of the invention, there is provided amethod of coding source data, the method comprising: obtaining sourcedata; determining how to encode the source data in accordance with thesystematic coding technique of any of the first aspect; and encoding thesource data in accordance with the determined systematic codingtechnique.

Preferably, the method further comprises transferring the source dataand redundant data over a network.

According to a fifth aspect of the invention, there is provided a codingtechnique for generating coded data from source data, the codingtechnique being equivalent to generating the coded data in dependence onthe method of any of the first to fourth aspects.

According to a sixth aspect of the invention, there is provided agenerator matrix for defining how to generate coded data from sourcedata, the generator matrix defining a code that is equivalent to a codethat has been generated by the method of any of the first to fifthaspects.

According to a seventh aspect of the invention, there is provided agenerator matrix, G, for defining how to generate coded data from sourcedata, wherein G has (α×n) rows and (α×k) columns with elements in afinite field in accordance with an above aspect, and G defines a codethat is equivalent to a code that has been generated by the method ofany of the first to sixth aspects and where G has the following form

${G = \begin{bmatrix}I \\P\end{bmatrix}},$

where I is an identity matrix with dimensions (α×k)×(α×k) and where P isa matrix with dimensions (α×r)×(α×k).

According to an eighth aspect of the invention, there is providedcomputing system configured to perform the method of any of the firstfourth aspects.

According to a ninth aspect of the invention, there is provided acomputer program that, when executed by a computing system, causes thecomputing system to perform the method of any of the first to fourthaspects.

According to a tenth aspect of the invention, there is provided a datastorage system, wherein n is the total number of data storage nodes ofthe data storage system, k is the total number of source data nodes ofthe data storage system, r is the total number of redundant data nodesof the data storage system, such that n=k+r, α is the number ofsubstripes of data in one of the nodes and each of the source data nodesand redundant data nodes comprise the same number of substripes, andwherein α satisfies the condition 1<α≤r^(m), where m=ceiling(k/r), thedata storage system configured to store data in accordance with themethod of the third aspect.

According to an eleventh aspect of the invention, there is providedmethod of recovering a node that is one of a plurality of systematicallycoded source and redundant nodes, the method comprising applying adecoding method that is the inverse of the method according to any ofthe first to fourth aspects.

According to a twelfth aspect of the invention, there is provided adecoding technique for recovering a node that is one of a plurality ofsystematically coded source and redundant nodes, wherein the decodingtechnique is the inverse of the coding technique according to the fifthaspect.

According to a thirteenth aspect of the invention, there is provided amethod of recovering one of a plurality of nodes, wherein the pluralityof nodes have been coded according to the method of any of the first tofourth aspects, the method comprising: obtaining a set of

$\left\lceil \frac{\alpha}{r} \right\rceil$

substripes of data from nodes of the plurality of nodes other than thenode being recovered; obtaining one or more further substripes of datafrom nodes of the plurality of nodes other than the node beingrecovered; using the obtained set of

$\left\lceil \frac{\alpha}{r} \right\rceil$

stripes to recover one or more substripes of the node being recovered;and recovering all of the other substripes of the node being recoveredin dependence on the one or more further substripes and a re-use of theobtained set of

$\left\lceil \frac{\alpha}{r} \right\rceil$

substripes.

According to a fourteenth aspect of the invention, there is provided amethod for recovering one of a plurality of nodes that have been codedaccording to the method of any of the first to fourth aspects, themethod comprising: receiving data defining how the plurality of nodeswere coded; configuring an algorithm with the received data; anddetermining, by the algorithm, how to recover the node in dependence ondata in the other of the plurality of nodes.

Preferably, the algorithm recovers a node, d_(l), by performing thefollowing steps:

-   -   accessing and transferring

$\left( {k - 1} \right)\left\lceil \frac{\alpha}{r} \right\rceil$

elements a_(i,j) from all k−1 non-failed systematic nodes and

$\left\lceil \frac{\alpha}{r} \right\rceil$

elements p_(i,1) from p₁ where i∈D_(ρ,d) _(l) ;

-   -   repairing a_(i,l)∈D_(ρ,d) _(l) ;    -   accessing and transferring

$\left( {r - 1} \right)\left\lceil \frac{\alpha}{r} \right\rceil$

elements p_(i,j) from p₂, . . . , p_(r) where i∈D_(ρ,d) _(j) ;

-   -   accessing and transferring from the systematic nodes the        elements a_(1,j) indexed in the i-th row of the index arrays P₂,        . . . , P_(r) where i∈D_(ρ,d) _(j) that are different from said        accessed and transferred

$\left( {k - 1} \right)\left\lceil \frac{\alpha}{r} \right\rceil$

elements a_(1,j) from allk−1 non-failed systematic nodes and

$\quad\left\lceil \frac{\alpha}{r} \right\rceil$

elements p_(i,1) from p₁; and

-   -   repairing a_(i,l) where i∈D\D_(ρ,d) _(l) ;    -   wherein:    -   p₁, . . . , p_(r) are redundant data nodes with respective index        arrays P₁, . . . , P_(r); and    -   the parameters of the code are as defined in the second aspect.

Preferably, the nodes are nodes of a data storage system.

According to a fifteenth aspect of the invention, there is provided amethod of decoding data, the method being equivalent to decoding data independence on the inverse of the generator matrix generated according tothe sixth or seventh aspects.

According to a sixteenth aspect of the invention, there is provided amethod of reading data from a data storage system, the method comprisingreading data in dependence on the method of any of claims eleventh tofifteenth aspect.

According to a seventeenth aspect, there is provided a method ofrecovering up to r failed nodes from a plurality of systematically codedsource and redundant nodes, the method being equivalent to decoding datain dependence on the inverse of the generator matrix according to thesixth or seventh aspects.

Preferably, the method is computer-implemented.

According to an eighteenth aspect, there is provided a computing systemconfigured to perform the method of any of the eleventh to seventeenthaspects.

According to a nineteenth aspect, there is provided a computer programthat, when executed by a computing system, causes the computing systemto perform the method of any of the eleventh to seventeenth aspects.

According to a twentieth aspect, there is provided a method fordetermining how to encode data in accordance with a systematic codingtechnique and encoding data in accordance with the determined systematiccoding technique, the method comprising: determining the code parametersn, k, r and α, wherein n is the total number of nodes, k is the totalnumber of source data nodes, r is the total number of redundant nodes,such that n=k+r, α is the number of substripes of data in one of thenodes and each of the source data nodes and redundant nodes comprise thesame number of substripes, and wherein α is determined so that itsatisfies either the condition 1<α≤r^(m) or both of the conditionsα=r^(m) and (k/r) is not an integer, where m=ceiling(k/r); determiningsource data nodes that comprise source data that is not encoded by thesystematic coding technique; for each of the redundant nodes,determining to generate each of the substripes of data in dependence ona combination of a different substripe from each of the source datanodes such that each of the substripes is generated in dependence on acombination of k substripes of source data and the α substripes of theredundant node are generated in dependence on all of the (α×k)substripes of source data; and determining each of one or more of thesubstripes of at least one of the redundant nodes to be furtherdependent on at least one further substripe of source data that it isnot currently dependent on, wherein said determination comprisesselecting a further substripe of source data for a redundant node to befurther dependent on with the further substripe being selectable fromany one of the k source nodes; and encoding data in accordance with thedetermined systematic coding technique.

Preferably, each of the substripes within each node is identifiable bysubstripe index level, i, where 1≤i≤α and i is an integer; for at leastone of the redundant nodes, said determining to generate each of thesubstripes of data in dependence on a combination of a differentsubstripe from each of the source data nodes comprises determining, foreach substripe of said at least one of the redundant nodes, thesubstripe of the redundant node to be a combination of a singlesubstripe from all of the source data nodes with the substripe of theredundant node and the substripes from each source data nodes all havingthe same substripe index level; and in said step of determining each ofone or more of the substripes of at least one of the redundant nodes tobe further dependent on at least one further substripe of source datathat it is not currently dependent on, the determination comprisesselecting a further substripe of source data for a redundant node to befurther dependent on with the further substripe being selectable fromany one of the α substripe index levels; and, preferably, thedetermination comprises selecting substripes from the source data nodesas further substripes of source data that one or more of the redundantnodes are further dependent on with the selection comprising at leastone substripe from each of the α substripe index levels.

Preferably, for each substripe index level, there is at least onesubstripe of a redundant node that is dependent on an additionalsubstripe of source data with a different substripe index level from thesubstripe of said at least one substripe of the redundant node.

Preferably, in said step of determining each of one or more of thesubstripes of at least one of the redundant nodes to be furtherdependent on at least one further substripe of source data that it isnot currently dependent on, the determination comprises, for all but oneof redundant nodes, the redundant nodes to be determined in dependenceon at least one additional substripe of source data.

Preferably, in said step of determining each of one or more of thesubstripes of at least one of the redundant nodes to be furtherdependent on at least one further substripe of source data that it isnot currently dependent on, the determination comprises selecting atleast one substripe from each of the k source data nodes as furthersubstripes of source data that one or more of the redundant nodes arefurther dependent on.

Preferably, in said step of determining each of one or more of thesubstripes of at least one of the redundant nodes to be furtherdependent on at least one further substripe of source data that it isnot currently dependent on, the determination comprises selecting atleast one substripe from each of the k source data nodes as furthersubstripes of source data that one of the redundant nodes is furtherdependent on.

Preferably, in said step of determining each of one or more of thesubstripes of at least one of the redundant nodes to be furtherdependent on at least one further substripe of source data that it isnot currently dependent on, the determination comprises, for each of twoor more of the redundant nodes, selecting at least one substripe fromeach of the k source data nodes as further substripes of source datathat the redundant node is further dependent on.

LIST OF FIGURES

FIG. 1 shows a coding technique according to an embodiment;

FIG. 2 shows a coding technique according to an embodiment;

FIG. 3 demonstrates the advantageous performance of coding techniquesaccording to embodiments over known coding techniques; and

FIG. 4 is a flowchart of a method for determining how to encode data inaccordance with a systematic coding technique according to anembodiment.

DESCRIPTION

Embodiments of the invention provide an advantageous systematic erasurecoding technique. It is shown that a code with (n, k, d=n−1), that isconstructed according to an embodiment, can have one or more of theadvantageous properties of being Maximum-Distance Separable (MDS),systematic, having a flexible sub-packetisation level, providing minimumrepair bandwidth, access optimality and fast decoding. A particularlyadvantageous property of the coding technique is the flexibility in thesub-packetisation level. That is to say, the number of substripes, i.e.sub-packets, that the data within a node is divided into can be flexiblychosen. Moreover, the sub-packetisation level can approach and/orinclude the lower bound as defined in the paper ‘Network coding fordistributed’ by A. G. Dimakis et al, IEEE Transactions on informationTheory, September 2010.

A problem with the codes disclosed in the Agarwal paper is that it isessential for k/r to be an integer in order for the codes to beconstructed. This excludes the use of many widely used data storagesystems, such as those with (n,k)=(14, 10).

Embodiments do not experience this problem and the codes can beconstructed both when k/r is an integer and when k/r is not an integer.According to embodiments, the sub-packetisation level, α, can be equalor lower than r^(┌m┐) where m=k/r and ┌ ┐ is the ceiling function. Theceiling function is equivalently represented as ceiling( ). The ceilingfunction rounds a non-integer value up to the next integer.

Advantageously, the coding technique can be flexibly applied in datastorage systems, such as those with (n,k)=(14, 10).

The construction of a systematic erasure code according to embodimentsis described in detail below.

In a data storage system, it is quite rare for there to be a failure ofa node and very rare for multiple node failures to occur simultaneously.During maintenance of the data storage system, disruption is usuallyminimised by only one node being offline at any given time. Accordingly,by far the most frequent requirement in a data storage system is therecovery of only one node.

The code according to embodiments is a systematic (n,k) code. That is tosay, there are n nodes that store data. Of the n nodes, k of the nodesstore source data that is data that has not been encoded by the erasurecoding technique. There are r redundant nodes, where r=n−k. Theperformance of the code is presented with (n,k,d=n−1). The performanceof the code is therefore demonstrated in the situation with only onenode being recovered as this is by far the most common practicalsituation experienced. However, the code can recover up to rsimultaneous failures.

The codes according to embodiments define redundant nodes as eachcomprising a linear combination of all of the source data packets. Atleast one of the redundant nodes is defined so that its substripes aredefined a linear combination of all of the substripes of source data butwith each substripe of source data used only once in the construction ofthe substripes of the redundant node. This is the left hand redundantnode in FIGS. 1 and 2.

For the other redundant nodes, one or more of their substripes arefurther defined as being dependent on additional substripes of thesource data. There is the condition that the introduced substripes intothe construction of a substripe of the redundant node are not substripesof source data that that particular substripe of the redundant node isalready dependent on. The introduced substripes, and into whichsubstripes of the redundant nodes they are added, may otherwise beselected randomly, pseudo-randomly, or according to a predeterminedtechnique. If the introduced substripes are selected randomly, orpseudo-randomly, then preferably the determined code is tested so thatit still meets the properties of being MDS. The introduction ofsubstripes may then be repeatedly changed and tested until a code withthe desired properties is obtained.

Such a definition of redundant nodes can be seen in FIGS. 1 and 2.Advantageously, the introduction of the additional substripes into thedefinition of redundant nodes greatly reduces the amount of data thatneeds to be read in order to reconstruct an unavailable node. After datahas been read to reconstruct a first substripe of the node, there-construction of the other substripes can be performed with asubstantial re-use of the already read data.

A technique for constructing codes according to embodiments is set outbelow.

Consider a file of size B=kA_(N) symbols from a finite file F_(q) storedin k systematic nodes d_(j) of data capacity A_(N). The capacity of anode can also be expressed by the sub-packetisation level, α, thatrepresents the number of substripes, i.e. sub-packets, or symbols, ofdata stored by the node. The substripes are the smallest blocks of datatransferred in operations to both encode and recover (i.e. decode) anode.

We define a code according to embodiments in the following way:

The data from the k source data nodes {d₁, d₂, . . . , d_(k)} where eachdata node d_(j) comprises of an indexed set of α substripes {a_(1,j),a_(2,j), . . . , a_(α,j)} is presented as a two-dimensional array Datawith α rows and k columns

${Data} = {\begin{bmatrix}a_{1,1} & a_{1,2} & \ldots & a_{1,k} \\a_{2,1} & a_{2,2} & \ldots & a_{2,k} \\\vdots & \vdots & \ddots & \vdots \\a_{\alpha,1} & a_{\alpha,2} & \ldots & a_{\alpha,k}\end{bmatrix}.}$

Let P=(i,j) be an index array of size α×k where α≤r^(┌m┐) and m=k/r. Wedefine r two-dimensional index arrays P₁, . . . , P_(r) for the rredundant data nodes {p₁, p₂, . . . , p_(r)} where each redundant nodep_(l), where 1≤l≤r, comprises of an indexed set of α substripes{p_(1,l), p_(2,l), . . . , p_(α,l))}. The symbols p_(i,l) in the paritynodes, where 1≤i≤α and 1≤l≤r, are generated as a combination of theelements from the source data nodes a_((j) ₁ _(,j) ₂ ₎, where the pair(j₁,j₂) is present in the i-th row of the index array P_(l). We refer tothe elements of the nodes as elements or symbols and we use the termsinterchangeably. The elements of each of the α rows are linearlycombined with coefficients from the finite field F_(q). These linearrelations are determined according to known techniques so that theyprovide an MDS code, i.e., to have the property that the entire sourcedata can be recovered from any k nodes (systematic or parity).

The index array for P₁ has a rows and k columns, and each cell in P₁ isa pair of indexes with the following values:

$P_{1} = {\begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \ldots & \left( {1,k} \right) \\\left( {2,1} \right) & \left( {2,2} \right) & \ldots & \left( {2,k} \right) \\\vdots & \vdots & \ddots & \vdots \\\left( {\alpha,1} \right) & \left( {\alpha,2} \right) & \ldots & \left( {\alpha,k} \right)\end{bmatrix}.}$

The index arrays P₂, . . . , P_(r) have α rows and k+m columns, and eachcell in P_(l), where 2≤l≤r, is a pair of indexes with the followingvalues:

${P_{l} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \ldots & \left( {1,k} \right) & \left( {?{,?)}} \right. & \ldots & \left( {?{,?)}} \right. \\\left( {2,1} \right) & \left( {2,2} \right) & \ldots & \left( {2,k} \right) & \left( {?{,?)}} \right. & \ldots & \left( {?{,?)}} \right. \\\vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \left( {?{,?)}} \right. \\\left( {\alpha,1} \right) & \left( {\alpha,2} \right) & \ldots & \left( {\alpha,k} \right) & \left( {?{,?)}} \right. & \ldots & \left( {?{,?)}} \right.\end{bmatrix}},$

where the pairs with values (?,?) are further determined withAlgorithm 1. We use the following terms and variables in Algorithm 1:

Algorithm 1 Algorithm to generate the index arrays

Input: n, k, α

Output: Index arrays P₁, . . . , P_(r).

-   -   1. Initialization: P₁, . . . , P_(r) are initialized as index        arrays P=(i,j)_(α×k);    -   2. Append additional ┌k/r┐ columns to P₂, . . . , P_(r) all        initialized to (0, 0);    -   3.

$\left. {{Set}\mspace{14mu} {portion}}\leftarrow\left\lceil \frac{\alpha}{r} \right\rceil \right.;$

-   -   4. Set ValidPartitions←∅;    -   5. Set j←0;    -   6. # Phase 1    -   7. repeat    -   8. Set j←j+1;    -   9. Set v←┌j/r┐;    -   10.

$\left. {{Set}\mspace{14mu} {run}}\leftarrow\left\lceil \frac{\alpha}{r^{v}} \right\rceil \right.;$

-   -   11.

$\left. {{set}\mspace{14mu} {step}}\leftarrow{\left\lceil \frac{\alpha}{r} \right\rceil - {run}} \right.;$

-   -   12. D_(d) _(j) =ValidPartitioning(ValidPartitions, k, r,        portion, run, step, j_(v));    -   13. Set ValidPartitions=ValidPartitions ∪ D_(d) _(j) ;    -   14. Determine one D_(ρ,d) _(j) ∈D_(d) _(j) such that its        elements correspond to row indexes in the (k+v)-th column in one        of the arrays P₂, . . . , P_(r), that are all zero pairs (0, 0);    -   15. The indexes in D_(ρ,d) _(j) are the row positions where the        pairs (i,j) with indexes i∈D\D_(ρ,d) _(j) are assigned in the        (k+v)-th column of P₂, . . . , P_(r);    -   16. until (run>1) AND (j≠0 mod r)    -   17. # Phase 2    -   18. while j<k do    -   19. Set j←j+1;    -   20. Set v←┌j/r┐;    -   21. Set run←0;    -   22. D_(d) _(j)        =ValidPartitioning(ValidPartitions,k,r,portion,run,step,J_(v));    -   23. Set ValidPartitions=ValidPartitions ∪ D_(d) _(j) ;    -   24. Determine one D_(ρ,d) _(j) ∈D_(d) _(j) such that its        elements correspond to row indexes in the (k+v)-th column in one        of the arrays P₂, . . . , P_(r), that are all zero pairs (0, 0);    -   25. The indexes in D_(ρ,d) _(j) are the row positions where the        pairs (i,j) with indexes i∈D\D_(ρ,d) _(j) are assigned in the        (k+v)-th column of P₂, . . . , P_(r);    -   26. end while    -   27. Return P₁, . . . , P_(r).

Algorithm 1 further calls the function ValidPartitioning whereValidPartitions, k,r, portion, run, step, J_(v) are inputs to thefunction and D_(d) _(j) ={D_(1,d) _(j) , . . . , D_(r,d) _(j) } is anoutput;

1. Set D={1, 2, . . . , α};

2. If run≠0 then

3. Find D_(d) _(j) that satisfies Condition 1 and Condition 2;

4. Else

5. Find D_(d) _(j) that satisfies Condition 2;

6. End if

7. Return D_(d) _(j) ;

where P₁, . . . , P_(r) and α are global variables;

The above algorithm receives as inputs n, k, α. The input n is typicallythe largest number of available nodes for storing source data andredundant data in a data storage system. The input k, which is thenumber of nodes storing source data, is a matter of design choice.Decreasing the value of k increases the number of nodes that can berecovered but also increases the storage overhead. The input α is amatter of design choice. It was shown in the Agarwal paper that, underspecific conditions, if α had certain specific values determined by nand k then improvements were made over RS codes in which α=1.

Embodiments advantageously allow α to be flexibly chosen. The choice ofac may be motivated by a value that is particularly appropriate for theunderlying hardware of the data storage system and/or achievingperformance gains over RS coding.

The output of the algorithm are index arrays that determine the way ofcombining substripes of source data to form all of the redundant nodes,also referred to as parity nodes, of a data storage system.

The terms and variables in Algorithm 1 and the functionValidPartitioning are described further below:

-   -   partitioning the set Nodes={d₁, . . . , d_(k)} of k nodes in        ┌k/r┐ disjunctive subsets

$J_{1},\ldots \mspace{14mu},J_{\lceil\frac{k}{r}\rceil}$

where |J_(v)|=r (if r does not divide k then the last subset

$J_{\lceil\frac{k}{r}\rceil}$

has k mod r elements) and

${Nodes} = {\bigcup_{v = 1}^{\lceil\frac{k}{r}\rceil}{J_{v}.}}$

Embodiments include this partitioning being any selection of k nodes,including random selections. Without loss of generality we use thenatural ordering as follows:

${J_{1} = \left\{ {d_{1},\ldots \mspace{14mu},d_{r}} \right\}},{J_{2} = \left\{ {d_{r + 1},\ldots \mspace{14mu},d_{2r}} \right\}},\ldots \mspace{14mu},{J_{\lceil\frac{k}{r}\rceil} = {\left\{ {d_{{{\lfloor\frac{k}{r}\rfloor} \times r} + 1},\ldots \mspace{14mu},d_{k}} \right\}.}}$

-   -   Each node d_(j) comprises an indexed set of α symbols {a_(1,j)        a_(2,j), . . . , a_(α,j)};

${{portion} = \left\lceil \frac{\alpha}{r} \right\rceil},$

the set of all symbols in d_(j) is partitioned in disjunctive subsetswhere at least one subset has portion number of elements.

-   -   The algorithm has two Phases. Phase 1 ends when the value of run        becomes 1 and the indexes of all nodes from a specific J_(v)        have been scheduled. In Phase 2, the indexes from the remaining        nodes are scheduled in the index arrays.

${{run} = \left\lceil \frac{\alpha}{r^{v}} \right\rceil},$

for values of

$v\; \epsilon \left\{ {1,\ldots \mspace{14mu},\left\lceil \frac{k}{r} \right\rceil} \right\}$

${{step} = {\left\lceil \frac{\alpha}{r} \right\rceil - {run}}},$

for the subsequent (k+v)-th column, where

${v \in \left\{ {1,\ldots \mspace{14mu},\left\lceil \frac{k}{r} \right\rceil} \right\}},$

the scheduling of the indexes corresponding to the nodes in J_(v) isdone in subsets of indexes from a valid partitioning.

-   -   A valid partitioning D_(d) _(j) ={D_(1,d) _(j) , . . . , D_(r,d)        _(j) } of a set of indexes D={1, 2, . . . , α}, where the i-th        symbol in d_(j) is indexed by i in D, is a partitioning in r        disjunctive subsets D_(d) _(j) =∪_(ρ=1) ^(r) D_(ρ,d) _(j) . If r        divides α, then the valid partitioning for all nodes in J_(v) is        equal. If r does not divide α, then the valid partitioning has        to contain at least one subset D_(ρ,d) _(j) with portion pairs        that correspond to row indexes in the (k+v)-th column in one of        the arrays P₂, . . . , P_(r), that are all zero pairs.    -   Condition 1: At least one subset D_(ρ,d) _(j) has portion pairs        with runs of run consecutive elements separated with a distance        between the indexes equal to step. The elements of that subset        correspond to row indexes in the (k+v)-th column in one of the        arrays P₂, . . . , P_(r), that are all zero pairs (0, 0). The        distance between two elements in one node is computed in a        cyclical manner, i.e., the distance between the elements a_(α−1)        and a₂ is 2; Condition 2: A necessary condition for the valid        partitioning of the index pairs in the systematic nodes to        achieve the lowest possible repair bandwidth is

D_(d_(j₁)) = D_(d_(j₂))

for all d_(j) ₁ and d_(j) ₂ in J_(v) and

D_(d_(j₁)) ≠ D_(d_(j₂))

for all d_(j) ₁ and d_(j) ₂ systematic nodes in the system. If portiondivides α, then D_(ρ,d) _(j) for all d_(h) in the J_(v)-th subset aredisjunctive, i.e., D=∪_(j=1) ^(r) D_(ρ,d) _(j) ={1, 2, . . . , α}.

The corresponding algorithm for the repair, i.e. recovery, of a singlesystematic node d_(l) is given in Algorithm 2. A set of

$\left\lceil \frac{\alpha}{r} \right\rceil$

symbols are accessed and transferred from each of the n−1 non-failednodes. If α≠r^(┌m┐), where

${m = \frac{k}{r}},$

then additional elements may be required as described in Step 4 ofAlgorithm 2. Note that when reading data in from a data storage system,specific elements of the data are transferred from their nodes just onceand then stored in a buffer. For every subsequent use of that elementthe element is read from the buffer and a further transfer operation toobtain the element from the data storage system is not required.Advantageously, the amount of read data is less than, for example, thatrequired for RS coding.

Algorithm 2 Repair of a systematic node d_(l)

-   -   1. Access and transfer

$\left( {k - 1} \right)\left\lceil \frac{\alpha}{r} \right\rceil$

elements a_(i,j) from allk−1 non-failed systematic nodes and

$\left\lceil \frac{\alpha}{r} \right\rceil$

elements p_(i,1) from p₁ where i∈D_(ρ, d) _(l) ;

-   -   2. Repair a_(i,l)∈D_(ρ,d) _(l) ;    -   3. Access and transfer

$\left( {r - 1} \right)\left\lceil \frac{\alpha}{r} \right\rceil$

elements p_(i,j) from p₂, . . . , p_(r) where i∈D_(ρ,d) _(j) ;

-   -   4. Access and transfer from the systematic nodes the elements        a_(i,j) with indexes listed in the i-th row of the index arrays        P₂, . . . , P_(r) where i∈D_(ρ,d) _(j) that have not been read        in step 1;    -   5. Repair a_(i,l) where i∈D\D_(ρ,d) _(l) ;

Proposition 1: The repair bandwidth β to repair a single systematic nodeis bounded between the following values (lower and upper bound of therepair traffic):

$\frac{\left( {n - 1} \right)}{r} \leq \beta \leq {\frac{\left( {n - 1} \right)}{r} + {\frac{\left( {r - 1} \right)}{\alpha}\left\lceil \frac{\alpha}{r} \right\rceil \left\lceil \frac{k}{r} \right\rceil}}$

The optimality of the proposed Algorithm is captured in the followingProposition.

Proposition 2: The indexes (i, l) of the elements a_(i,l) wherei∈D\D_(ρ,d) _(l) for each group of r systematic nodes are scheduled inone of the

$\left\lceil \frac{k}{r} \right\rceil$

additional columns in the index arrays P₂, . . . P_(r).

Next we show that there always exists a set of non-zero coefficientsfrom F_(q) in the linear combinations so that the code is MDS. We adoptTheorem 4.1 in the Agarwal paper:

Theorem 1: There exists a choice of non-zero coefficients c_(l,i,j)where l=1, . . . , r, i=1, . . . , α and j=1, . . . , k from F_(q) suchthat the code is MDS if

$q \geq {\begin{pmatrix}n \\k\end{pmatrix}r\mspace{14mu} {\alpha.}}$

Examples of the generation of a code according to the Algorithmspresented herein are provided below. Embodiments are first demonstratedwith a (9,6,8) code that has the unusual sub-packetisation level equalof 7. Embodiments are then demonstrated with a (14, 10, 9) code. This isa practical code that is used, for example, in the data storage systemof Facebook™.

The first embodiment constructs a (n,k,d)=(9,6,8) code. The optimalsub-packetisation level to achieve the lower bound of the repair trafficgiven in the Agarwal paper for this code is 3^(┌6/3┐)=9 and this is theonly sub-packetisation level possible with the codes in the Agarwalpaper. The present embodiment demonstrates the use of a flexiblesub-packetisation level with α=7.

The algorithms disclosed herein allow any sub-packetisation level to beused that is less than or equal to r^(┌k/r┐).

FIG. 1 shows the structure of the systematic code generated according tothe embodiment. The nodes d₁, . . . , d₆ are the systematic nodes, alsoreferred to as source data nodes, and store source data. The nodes p₁,p₂ and p₃ are the redundant nodes, also referred to as parity nodes. Thefile size B is 42 symbols. The elements of p₁ are linear combinations ofthe row elements from the systematic nodes multiplied by coefficientsfrom a finite field, while the parity nodes p₂ and p₃ have furtherelements besides the row sum. The embodiment schedules the indexes ofthe elements of the systematic nodes that do not belong to D_(ρ,d) _(j)at portion=3 positions on the (6+v)-th column, v=1, 2, of P₂ and P₃.

The following steps are performed to determine the further elements inthe parity nodes p₂ and p₃ besides the row sum.

-   -   1. Initialize P₁, P₂, P₃ as arrays P=(i,j)_(7×6).    -   2. Append additional ┌6/3┐=2 columns to P₂ and P₃ initialized to        (0, 0).    -   3. We use the notation D_(ρ,d) _(j) =1, . . . , 6, to denote the        subset with its elements corresponding to row indexes in the        (6+v)-th column, v=1, 2, in one of the arrays P₂ and P₃, that        are all zero values (0, 0).    -   4. Set portion equal to 3 and ValidPartitions to an empty set.    -   5. # Phase 1:        -   For the systematic nodes d₁, d₂ and d₃ in J₁, run is equal            to 3 and step is equal to 0,    -   6. We call the function ValidPartitioning( ) with appropriate        parameters for this phase. In this phase both Condition 1 and        Condition 2 have to be fulfilled. For the node d₁ as the output        from the function ValidPartitioning( ) we get D_(d) ₁ ={{1,2,3},        {4,5,6}, {7}}. Further on with steps 14 and 15 of Algorithm 1 we        determine D_(ρ,d) ₁ ={1, 2, 3}. This is due to the fact that the        first 3 zero elements in the 7-th column of P₂ are at the        positions (i, 7) where i=1, 2, 3. Note that the run length is 3        and the distance between the indexes is 0. The i indexes of the        remaining pairs (i, 1) where i=4, . . . , 7 belong to 2 other        subsets D\D_(ρ,d) ₁ , i.e., D_(2,d) ₁ ={4, 5, 6} and D_(3,d) ₁        ={7}. The pairs (i, 1) for i∈D\D_(ρ,d) ₁ are added in the 7-th        column of P₂, P₃. Similarly, for the node d₂ as the output from        the function ValidPartitioning( ) we get D_(d) ₁ ={{1,2,3},        {4,5,6}, {7}} and after steps 14 and 15 in Algorithm 1 we get        D_(ρ,d) ₂ ={4, 5, 6}. For the node d₃ we get D_(d) ₃ ={{1,2,3},        {5,6,7}, {4}} and D_(ρ,d) ₃ ={5, 6, 7}.    -   7. At the end of this phase ValidPartitions={{1,2,3}, {4,5,6},        {7}, {5,6,7}, {4}}    -   8. At the end of this phase a snapshot from three index arrays        P₁, P₂, P₃ is like this:

$\mspace{20mu} {P_{1} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \left( {1,3} \right) & \left( {1,4} \right) & \left( {1,5} \right) & \left( {1,6} \right) \\\left( {2,1} \right) & \left( {2,2} \right) & \left( {2,3} \right) & \left( {2,4} \right) & \left( {2,5} \right) & \left( {2,6} \right) \\\left( {3,1} \right) & \left( {3,2} \right) & \left( {3,3} \right) & \left( {3,4} \right) & \left( {3,5} \right) & \left( {3,6} \right) \\\left( {4,1} \right) & \left( {4,2} \right) & \left( {4,3} \right) & \left( {4,4} \right) & \left( {4,5} \right) & \left( {4,6} \right) \\\left( {5,1} \right) & \left( {5,2} \right) & \left( {5,3} \right) & \left( {5,4} \right) & \left( {5,5} \right) & \left( {5,6} \right) \\\left( {6,1} \right) & \left( {6,2} \right) & \left( {6,3} \right) & \left( {6,4} \right) & \left( {6,5} \right) & \left( {6,6} \right) \\\left( {7,1} \right) & \left( {7,2} \right) & \left( {7,3} \right) & \left( {7,4} \right) & \left( {7,5} \right) & \left( {7,6} \right)\end{bmatrix}}$ $P_{2} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \left( {1,3} \right) & \left( {1,4} \right) & \left( {1,5} \right) & \left( {1,6} \right) & \left( {4,1} \right) & \left( {0,0} \right) \\\left( {2,1} \right) & \left( {2,2} \right) & \left( {2,3} \right) & \left( {2,4} \right) & \left( {2,5} \right) & \left( {2,6} \right) & \left( {5,1} \right) & \left( {0,0} \right) \\\left( {3,1} \right) & \left( {3,2} \right) & \left( {3,3} \right) & \left( {3,4} \right) & \left( {3,5} \right) & \left( {3,6} \right) & \left( {6,1} \right) & \left( {0,0} \right) \\\left( {4,1} \right) & \left( {4,2} \right) & \left( {4,3} \right) & \left( {4,4} \right) & \left( {4,5} \right) & \left( {4,6} \right) & \left( {1,2} \right) & \left( {0,0} \right) \\\left( {5,1} \right) & \left( {5,2} \right) & \left( {5,3} \right) & \left( {5,4} \right) & \left( {5,5} \right) & \left( {5,6} \right) & \left( {2,2} \right) & \left( {0,0} \right) \\\left( {6,1} \right) & \left( {6,2} \right) & \left( {6,3} \right) & \left( {6,4} \right) & \left( {6,5} \right) & \left( {6,6} \right) & \left( {3,2} \right) & \left( {0,0} \right) \\\left( {7,1} \right) & \left( {7,2} \right) & \left( {7,3} \right) & \left( {7,4} \right) & \left( {7,5} \right) & \left( {7,6} \right) & \left( {4,3} \right) & \left( {0,0} \right)\end{bmatrix}$ $P_{3} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \left( {1,3} \right) & \left( {1,4} \right) & \left( {1,5} \right) & \left( {1,6} \right) & \left( {7,1} \right) & \left( {0,0} \right) \\\left( {2,1} \right) & \left( {2,2} \right) & \left( {2,3} \right) & \left( {2,4} \right) & \left( {2,5} \right) & \left( {2,6} \right) & \left( {0,0} \right) & \left( {0,0} \right) \\\left( {3,1} \right) & \left( {3,2} \right) & \left( {3,3} \right) & \left( {3,4} \right) & \left( {3,5} \right) & \left( {3,6} \right) & \left( {0,0} \right) & \left( {0,0} \right) \\\left( {4,1} \right) & \left( {4,2} \right) & \left( {4,3} \right) & \left( {4,4} \right) & \left( {4,5} \right) & \left( {4,6} \right) & \left( {7,2} \right) & \left( {0,0} \right) \\\left( {5,1} \right) & \left( {5,2} \right) & \left( {5,3} \right) & \left( {5,4} \right) & \left( {5,5} \right) & \left( {5,6} \right) & \left( {1,3} \right) & \left( {0,0} \right) \\\left( {6,1} \right) & \left( {6,2} \right) & \left( {6,3} \right) & \left( {6,4} \right) & \left( {6,5} \right) & \left( {6,6} \right) & \left( {2,3} \right) & \left( {0,0} \right) \\\left( {7,1} \right) & \left( {7,2} \right) & \left( {7,3} \right) & \left( {7,4} \right) & \left( {7,5} \right) & \left( {7,6} \right) & \left( {3,3} \right) & \left( {0,0} \right)\end{bmatrix}$

-   -   9. # Phase 2:    -   10. For the systematic nodes d₄, d₅ and d₆ in J₂ set run equal        to 0. Now, when we call the function ValidPartitioning( ) with        appropriate parameters for this phase, only the Condition 2 has        to be fulfilled. For the node d₄ as the output from the function        ValidPartitioning( ) we get D_(d) ₄ ={{1,4,7}, {2,5}, {3,6}} and        after steps 24 and 25 in Algorithm 1 we get D_(ρ,d) ₄ ={1, 4,        7}. A similar steps and outputs are obtained for d₅ and d₆ and        they are: D_(ρ,d) ₅ ={2, 5, 1} and D_(ρ,d) ₆ ={3, 6, 2}.    -   11. At the end of this phase ValidPartitions={{1,2,3}, {4,5,6},        {7}, {5,6,7}, {4}, {1,4,7}, {2,5}, {3,6}, {2,5,1}, {3,6,2}}    -   12. At the end of this phase a snapshot from three index arrays        P₁, P₂, P₃ is like this:

$P_{1} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \left( {1,3} \right) & \left( {1,4} \right) & \left( {1,5} \right) & \left( {1,6} \right) \\\left( {2,1} \right) & \left( {2,2} \right) & \left( {2,3} \right) & \left( {2,4} \right) & \left( {2,5} \right) & \left( {2,6} \right) \\\left( {3,1} \right) & \left( {3,2} \right) & \left( {3,3} \right) & \left( {3,4} \right) & \left( {3,5} \right) & \left( {3,6} \right) \\\left( {4,1} \right) & \left( {4,2} \right) & \left( {4,3} \right) & \left( {4,4} \right) & \left( {4,5} \right) & \left( {4,6} \right) \\\left( {5,1} \right) & \left( {5,2} \right) & \left( {5,3} \right) & \left( {5,4} \right) & \left( {5,5} \right) & \left( {5,6} \right) \\\left( {6,1} \right) & \left( {6,2} \right) & \left( {6,3} \right) & \left( {6,4} \right) & \left( {6,5} \right) & \left( {6,6} \right) \\\left( {7,1} \right) & \left( {7,2} \right) & \left( {7,3} \right) & \left( {7,4} \right) & \left( {7,5} \right) & \left( {7,6} \right)\end{bmatrix}$ $P_{2} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \left( {1,3} \right) & \left( {1,4} \right) & \left( {1,5} \right) & \left( {1,6} \right) & \left( {4,1} \right) & \left( {2,4} \right) \\\left( {2,1} \right) & \left( {2,2} \right) & \left( {2,3} \right) & \left( {2,4} \right) & \left( {2,5} \right) & \left( {2,6} \right) & \left( {5,1} \right) & \left( {3,5} \right) \\\left( {3,1} \right) & \left( {3,2} \right) & \left( {3,3} \right) & \left( {3,4} \right) & \left( {3,5} \right) & \left( {3,6} \right) & \left( {6,1} \right) & \left( {1,6} \right) \\\left( {4,1} \right) & \left( {4,2} \right) & \left( {4,3} \right) & \left( {4,4} \right) & \left( {4,5} \right) & \left( {4,6} \right) & \left( {1,2} \right) & \left( {5,4} \right) \\\left( {5,1} \right) & \left( {5,2} \right) & \left( {5,3} \right) & \left( {5,4} \right) & \left( {5,5} \right) & \left( {5,6} \right) & \left( {2,2} \right) & \left( {6,5} \right) \\\left( {6,1} \right) & \left( {6,2} \right) & \left( {6,3} \right) & \left( {6,4} \right) & \left( {6,5} \right) & \left( {6,6} \right) & \left( {3,2} \right) & \left( {5,6} \right) \\\left( {7,1} \right) & \left( {7,2} \right) & \left( {7,3} \right) & \left( {7,4} \right) & \left( {7,5} \right) & \left( {7,6} \right) & \left( {4,3} \right) & \left( {0,0} \right)\end{bmatrix}$ $P_{3} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \left( {1,3} \right) & \left( {1,4} \right) & \left( {1,5} \right) & \left( {1,6} \right) & \left( {7,1} \right) & \left( {3,4} \right) \\\left( {2,1} \right) & \left( {2,2} \right) & \left( {2,3} \right) & \left( {2,4} \right) & \left( {2,5} \right) & \left( {2,6} \right) & \left( {0,0} \right) & \left( {4,5} \right) \\\left( {3,1} \right) & \left( {3,2} \right) & \left( {3,3} \right) & \left( {3,4} \right) & \left( {3,5} \right) & \left( {3,6} \right) & \left( {0,0} \right) & \left( {4,6} \right) \\\left( {4,1} \right) & \left( {4,2} \right) & \left( {4,3} \right) & \left( {4,4} \right) & \left( {4,5} \right) & \left( {4,6} \right) & \left( {7,2} \right) & \left( {6,4} \right) \\\left( {5,1} \right) & \left( {5,2} \right) & \left( {5,3} \right) & \left( {5,4} \right) & \left( {5,5} \right) & \left( {5,6} \right) & \left( {1,3} \right) & \left( {7,5} \right) \\\left( {6,1} \right) & \left( {6,2} \right) & \left( {6,3} \right) & \left( {6,4} \right) & \left( {6,5} \right) & \left( {6,6} \right) & \left( {2,3} \right) & \left( {7,6} \right) \\\left( {7,1} \right) & \left( {7,2} \right) & \left( {7,3} \right) & \left( {7,4} \right) & \left( {7,5} \right) & \left( {7,6} \right) & \left( {3,3} \right) & \left( {0,0} \right)\end{bmatrix}$

The above demonstrates how the indexes of the substripes used togenerate each of the redundant nodes of a systematic erasure code aredetermined according to an embodiment.

Next we demonstrate how the systematic node d₁ is recovered, accordingto an embodiment, in the event of this node being unavailable. First, werepair the elements a_(i,1), i∈D_(ρ,d) ₁ , i.e., a_(i,1) where i=1, 2,3.

We therefore access and transfer 3 a_(i,j) elements where i=1, 2, 3 andj=2, . . . , 6 from the non-failed systematic nodes and 3 p_(i,1)elements where i=1, 2, 3 from p₁. In order to recover the rest of theelements a_(i,1), i∈D\D_(ρ,d) ₁ , we need to access and transfer 3symbols p_(i,2) where i=1, 2, 3 from p₂ and 1 symbol p_(1,3) from p₃.This demonstrates that the data from d₁ can be recovered by accessingand transferring in total 22 elements from the 8 non-failed nodes. Thesame amount of data, 22 symbols, is needed to recover any of the othersystematic nodes in the event of the failure of a single node.Accordingly, codes generated according to algorithm 1 are balanced inthat the same amount of data needs to read in order to recover any oneof the nodes.

In preferred embodiments, one of the redundant nodes is generated independence on all of the substripes of the source data nodes with all ofthe substripes of the redundant node comprising different substripes ofthe source data nodes such that no two substripes of the redundant nodeare generated in dependence on the same substripe of source data. Thecoding advantages of embodiments are realised by the generation of oneor more further redundant nodes that are also generated in dependence onall of the substripes of the source data nodes but with one or moresubstripes within each node being generated in further dependence ofsource data packets such that the same substripe(s) of source data areused in the generation of different substripes of the same redundantnode. This technique allows the re-use of already read data whenreconstructing a node and therefore less data needs to be read than inRS coding.

FIG. 2 shows the structure of the systematic (14, 10, 13) code generatedaccording to a second embodiment. Note that this code cannot begenerated with the method presented in the Agarwal paper. The optimalsub-packetisation level to achieve the lower bound of the repair trafficgiven in the paper ‘Network coding for distributed’ by A. G. Dimakis etal, IEEE Transactions on information Theory, September 2010 is4^(┌10/4┐)=64 but constructing a (14, 10, 13) code with thissub-packetisation level is not possible with the codes disclosed in theAgarwal paper.

We demonstrate the advantage of embodiments of being able to constructan access-optimal code for any sub-packetization level, thus we set α=8.

The nodes d₁, . . . , d₁₀ are the systematic nodes, also referred to assource data nodes, and store source data. The nodes p₁, p₂, p₃ and p₄are the redundant nodes, also referred to as parity nodes. The file sizeB is 80 symbols. The elements of p₁ are linear combinations of the rowelements from the systematic nodes multiplied by coefficients from afinite field, while the parity nodes p₂, p₃ and p₄ have extra elementsbesides the row sum.

The above described Algorithm 1 is used to determine the additionalsubstripes of source data that all but one of the parity nodes is to begenerated in dependence on. The embodiment schedules the elementsa_(i,j) from a specific d_(j) where i∈D\D_(p,d) _(j) at portionpositions in i-th row, i∈D_(p,d) _(j) and the (10+v)-th column, v=1, 2,3 of p₂, p₃ and p₄. The determined combinations of substripes for eachof the parity nodes is shown in FIG. 2.

To demonstrate the improved performance of embodiments, we compare theperformance for an (14, 10) code under RS, those in the Hitchhiker paperand codes generated according to embodiments.

FIG. 3 depicts the correlation between the average repair bandwidth andthe sub-packetisation level. Average repair bandwidth is defined as theratio of the total repair bandwidth to repair all systematic nodes tothe file size B. The average repair bandwidth is equal to the file sizefor a RS code and this corresponds to the value of the repair bandwidthwhen the sub-packetisation level is 1. A Hitchhiker code with asub-packetisation level equal to 2 reduces the repair bandwidth by 35%from that of the RS code. The rest of the values for the average repairbandwidth are for different sub-packetisation levels and determined forthe codes according to embodiments. The lowest average repair bandwidthis 3.25 and is reached for α=r^(m)=64, where m=┌k/r┐. The codesaccording to embodiments reduce the repair bandwidth for any systematicnode by 67.5% from that of the RS code.

Embodiments therefore provide a general construction of access-optimalregenerating codes for any systematic node when the sub-packetisationlevel α is less than or equal to r^(m), where m=┌k/r┐ and k/r is notrestricted to being an integer. The wide range for sub-packetisationlevels together with the two Phases in Algorithm 1 in the codeconstruction lead to a high flexibility of constructing codes fordifferent code rates and for different bottlenecks caused by underlyinghardware and system configurations. The lower bound of the repairbandwidth is achieved for α=r^(┌k/r┐), while the repair bandwidth isclose the lower bound when α<r^(┌k/r┐). The repair process of a failedsystematic node is linear and highly parallelized. A set of ┌α/r┐symbols is independently repaired first and used along with the accesseddata from other nodes to repair the remaining symbols.

Codes according embodiments include codes in which k/r is an integer.For such codes, a may be less than r^(m), where m=k/r. Embodiments doalso include codes in which k/r is an integer and α is equal to r^(m),where m=k/r. There are advantages of codes according to theseembodiments over the codes disclosed in the Agarwal paper. For example,the codes according to embodiments have improved properties such asproviding access optimality, fast decoding, consecutiveness of stripesand others. In addition, as described in more detail later, thetechniques and algorithm disclosed herein for constructing codesaccording to embodiments can be used to construct a large number ofcodes and the codes with the most appropriate properties can be selectedfor a particular application.

Accordingly, the embodiment provides an advantageous structure of code.Embodiments are not limited to the construction of codes according tothe specific algorithms disclosed herein and embodiments extend to anysystematic erasure coding technique that determines how to generatesubstripes of redundant nodes in a way that improves over RS codingwhilst providing advantageous flexible code construction according toembodiments.

As shown in FIGS. 1 and 2, the first of the redundant nodes is generatedas a linear combination of substripes within the source data nodes. Theother redundant nodes are based on a similar combination of source nodesas the first redundant node, though the linear coefficients of each nodemay differ. The other redundant nodes may be based on the samecombination of substripes as for the first redundant node and onlydiffer due to the additional substripes that are introduced into thecombination. The algorithms presented herein determine how to generatesuch codes and to achieve significant advantages over RS coding.However, in practice, the addition of any additional substripes into thecombination improves on RS coding so long as the introduced substripe isnot a substripe that the substripe of the redundant node is alreadydetermined to be dependent on. The operation at the substripe levelallows the re-use of already obtained data when recovering anunavailable node and thereby reduces the amount of data that needs to beread from RS coding.

In particular, when determining which additional substripes are includedin a combination for generating a substripe of a redundant node, thereis a lot of flexibility on which substripes can be included. Forexample, the additional substripes that are determined for inclusion ina combination are not restricted by the condition of being dependent onmessage symbols from previous instances. That is to say, the selectionof each additional substripe is independent from the order of writingand reading the previous substripes of source data. Preferably, theselection of each additional substripe is also independent from theprevious determinations of substripe combinations. For example, anadditional substripe may be selected for any of the substripe levels ofany of the source nodes. This flexibility allows codes with a wide rangeof properties may be generated so as to realise, for example, anadvantageous repair bandwidth.

Particularly preferred determinations of additional substripes forgenerating one or more substripes of a redundant node in dependence oninclude one or more of the following:

-   -   At least one additional source data substripe being selectable        from any one of the source nodes;    -   At least one additional source data substripe from each        substripe index level being included in the redundant nodes,        wherein each of the substripes within each node is identifiable        by substripe index level, i, where 1≤i≤α and i is an integer;    -   For each substripe index level, there is at least one substripe        of a redundant node that is dependent on an additional substripe        of source data with a different substripe index level from the        substripe of said at least one substripe of the redundant node;    -   At least one additional source data substripe being selectable        from any one of the substripe index levels;    -   for all but one of redundant nodes, the redundant nodes to be        determined in dependence on at least one additional substripe of        source data;    -   selecting at least one substripe from each of the source data        nodes as further substripes of source data that one or more of        the redundant nodes are further dependent on;    -   selecting at least one substripe from each of the source data        nodes as further substripes of source data that one of the        redundant nodes is further dependent on; and    -   for each of two or more of the redundant nodes, selecting at        least one substripe from each of the source data nodes as        further substripes of source data that the redundant node is        further dependent on.

Accordingly, embodiments provide a systematic erasure coding techniquewith a flexible sub-packetisation level that improves on RS coding.Improvements are made by determining each of one or more of thesubstripes of at least one other redundant node than the first redundantnode to be further dependent on at least one further substripe of sourcedata that it is not currently dependent on.

FIG. 4 is a flowchart of a method for determining how to encode data inaccordance with a systematic coding technique according to anembodiment.

In step 401, the process begins.

In step 403, the code parameters n, k, r and α are determined, wherein nis the total number of nodes, k is the total number of source datanodes, r is the total number of redundant data nodes, such that n=k+r, αis the number of substripes of data in one of the nodes and each of thesource data nodes and redundant data nodes comprise the same number ofsubstripes, and wherein α satisfies the condition 1<α≤r^(m), where

$m = {\left\lceil \frac{k}{r} \right\rceil.}$

In step 405, source data nodes are determined that comprise source datathat is not encoded by the systematic erasure coding technique.

In step 407, it is determined to generate, for each of the r redundantnodes, each of the α substripes of data in dependence on a combinationof a different substripe from each of the k source data nodes such thateach of the ac substripes is generated in dependence on a combination ofk substripes of source data and the α substripes of the redundant nodeare generated in dependence on all of the α×k substripes of source data.

In step 409, each of one or more of the ac substripes of at least one ofthe r redundant nodes are determined to be further dependent on at leastone further substripe of source data that it is not currently dependenton.

In step 411, the process ends.

Embodiments of the invention also include a number of modifications andvariations to the embodiments as described above.

Preferably, each coding of source and redundant nodes that are generatedaccording to embodiments is tested to determine that it has the propertyof being MDS. This preferably is part of an iterative process with eachiteration changing one or more of the coefficients of one of theadditional substripes of source data that a redundant node is made to bedependent on, the actual substripe of source data or the substripe ofthe redundant node that an additional substripe is added to. Theiterative process would be stopped as soon as the generator matrix wasdetermined to be MDS. Advantageously, the coding of source and redundantnodes according to this embodiment is always MDS.

An advantageous property of the coding of source and redundant nodesaccording to embodiments is that, when the coding is expressed as agenerator matrix, if all of the columns of the generator matrix arelinearly independent, the generator matrix defines an MDS codingtechnique. The process of determining if a code is MDS is thereforestraightforward for a skilled person to implement.

TABLE 1 Code (n, k) k r (6, 3) 3 3 (9, 6) 6 3 (12, 9)  9 3 (14, 11) 11 3(15, 12) 12 3 (16, 13) 13 3 (17, 13) 14 3 (18, 14) 14 3 (19, 16) 16 3(20, 17) 17 3 (23, 20) 20 3 (24, 21) 21 3 (8, 4) 4 4 (9, 5) 5 4 (10, 6) 6 4 (12, 8)  8 4 (14, 10) 10 4 (15, 11) 11 4 (16, 12) 12 4 (18, 14) 14 4(19, 15) 15 4 (20, 16) 16 4 (22, 18) 18 4 (23, 19) 19 4 (24, 20) 20 4(25, 21) 21 4 (26, 22) 22 4 (28, 24) 24 4 (30, 26) 26 4 (32, 28) 28 4(34, 30) 30 4 (10, 5)  5 5 (12, 7)  7 5 (15, 10) 10 5 (20, 15) 15 5 (25,20) 20 5 (30, 35) 25 5 (35, 30) 30 5 (40, 35) 35 5 (45, 40) 40 5 (12,6)  6 6 (15, 9)  9 6 (18, 12) 12 6 (19, 13) 13 6 (20, 14) 14 6 (26, 20)20 6 (30, 24) 24 6 (31, 25) 25 6 (46, 40) 40 6 (14, 7)  7 7 (27, 7)  207 (16, 8)  8 8 (20, 12) 12 8 (28, 20) 20 8 (32, 24) 24 8 (48, 40) 40 8

Example values for k and r for implementation in embodiments are shownin Table 1. The codes shown in Table 1 cover many common disk arraysizes. Particularly preferable are codes with k a lot greater than r.These are the most efficient as only one source node reconstruction at aparticular time is likely to be required. The underlying finite fieldmay be GF(256). Embodiments also include using more complex GF(256)operations and this would allow larger configurations to be realised.The codes according to embodiments can be constructed over any GF fieldof size GF(2^(n)) so long as n is large enough for a MDS erasure code tobe generated. A typical implementation would have a GF field size ofGF(16) or GF(256).

Embodiments are particularly appropriate for generating data for storagein nodes of a data storage system. In a typical implementation of such asystem the number of redundant data nodes is a lot fewer than the numberof source data nodes. In a preferred implementation of an embodiment ina data storage system, the number of substripes in each node is lessthan or equal to r^(m), where m=┌k/r┐.

A particularly advantageous property of the codes according toembodiments is that there is a lot of flexibility when generating codeswith the same parameters of k, r, α. That is to say, a large number ofdifferent codes can be generated with all of the codes being designedfor systems with the same number of source nodes, the same number ofredundant nodes and the same sub-packetisation level. However, thegenerated codes will differ in their properties and by selecting codesfor use that have the most appropriate properties for a particularapplication the used codes can be optimised for the particularapplication. Examples of properties that the codes can be optimised withrespect to are repair bandwidth, the number of Input/Output operations(i.e. the number of reads and writes) and the latency of the repair. Thecode optimisation with respect to a particular property may be performedas part of a stand alone iterative process for optimising MDS codes, orthe code optimisation may be performed within the above-describediterative process for determining if a generator matrix is for an MDScode.

The actual generation of coded data in dependence on source dataaccording to embodiments can be performed with known techniques andusing known hardware. The processes required to use the techniquesaccording to embodiments to generate a plurality of source nodes andredundant nodes in a data storage system/network of a data centrestoring the coded data would be a straightforward task for the skilledperson. The skilled person would also be able to use known hardware toreconstruct one or more source nodes in order to implement embodiments.

The nodes according to embodiments include single data disks, or drives,or groups or data disks, or drives. A node includes any form of datastorage element, a part of such an element or multiple such elements. Inparticular, a node can be any logical entity where data can be stored,and can be anything from a whole, a group of or parts of physicalstorage devices or locations including but not limited to memory basedstorage devices such as RAM and SSDs, hard drives, tape storage, opticalstorage devices, servers and data centers. The method according toembodiments may be performed within a single SSD disk. The methodaccording to embodiments may be performed between chips inside a SSD, orbetween banks inside (flash) chips.

The coding technique according to embodiments can therefore be used foreach file/object/entity of data being stored on storage nodes/storagemediums. For example, say that the storage nodes comprise 14 hard drivesand the encoding scheme has 10 data nodes and four redundancy nodes. Thecoding technique of embodiments can be applied to each file being storedon the hard drives so that each file is split into 10 segments and fourredundancy segments of data is generated for each file. The encodingpattern is therefore found multiple times in the stored data (as manytimes as there are files being stored) on the storage nodes and notnecessarily once for all of the storage nodes. The pattern is thereforerepeated for each file.

The storage of the data in a data storage system is not limited to thedata storage system having nodes, i.e. data drives or sections of a datadrive, that are only for use as a store of source data node or redundantdata. The systematic property may be maintained but a mapping introducedso that a data drive may store redundant data within a source data nodeand vice-versa. This interleaving of data changes the mapping of codeddata to stored data and can be used to control the read operations froma data storage system, for example to ensure that the network traffic isbalanced across the data storage system.

Although data storage is a particularly preferable application for thecoding techniques disclosed herein embodiments include the generation ofcodes for any application, such as data transmission. For example, thenodes may correspond to data packets for transmission over a network.

The systematic coding techniques according to embodiments includeerasure resilient systematic coding techniques.

The flow charts and descriptions thereof herein should not be understoodto prescribe a fixed order of performing the method steps describedtherein. Rather, the method steps may be performed in any order that ispracticable. Although the present invention has been described inconnection with specific exemplary embodiments, it should be understoodthat various changes, substitutions, and alterations apparent to thoseskilled in the art can be made to the disclosed embodiments withoutdeparting from the spirit and scope of the invention as set forth in theappended claims.

Methods and processes described herein can be embodied as code (e.g.,software code) and/or data. Such code and data can be stored on one ormore computer-readable media, which may include any device or mediumthat can store code and/or data for use by a computer system. When acomputer system reads and executes the code and/or data stored on acomputer-readable medium, the computer system performs the methods andprocesses embodied as data structures and code stored within thecomputer-readable storage medium. In certain embodiments, one or more ofthe steps of the methods and processes described herein can be performedby a processor (e.g., a processor of a computer system or data storagesystem). It should be appreciated by those skilled in the art thatcomputer-readable media include removable and non-removablestructures/devices that can be used for storage of information, such ascomputer-readable instructions, data structures, program modules, andother data used by a computing system/environment. A computer-readablemedium includes, but is not limited to, volatile memory such as randomaccess memories (RAM, DRAM, SRAM); and non-volatile memory such as flashmemory, various read-only-memories (ROM, PROM, EPROM, EEPROM), magneticand ferromagnetic/ferroelectric memories (MRAM, FeRAM), and magnetic andoptical storage devices (hard drives, magnetic tape, CDs, DVDs); networkdevices; or other media now known or later developed that is capable ofstoring computer-readable information/data. Computer-readable mediashould not be construed or interpreted to include any propagatingsignals.

1-48. (canceled)
 49. A method for determining how to encode data inaccordance with a systematic coding technique and encoding data inaccordance with the determined systematic coding technique, the methodcomprising: determining the code parameters n, k, r and α, wherein n isthe total number of nodes, k is the total number of source data nodes, ris the total number of redundant nodes, such that n=k+r, α is the numberof substripes of data in one of the nodes and each of the source datanodes and redundant nodes comprise the same number of substripes, andwherein α is determined so that it satisfies either the condition1<α<r^(m) or both of the conditions α=r^(m) and (k/r) is not an integer,where m=ceiling(k/r); determining source data nodes that comprise sourcedata that is not encoded by the systematic coding technique; for each ofthe redundant nodes, determining to generate each of the substripes ofdata in dependence on a combination of a different substripe from eachof the source data nodes such that each of the substripes is generatedin dependence on a combination of k substripes of source data and the αsubstripes of the redundant node are generated in dependence on all ofthe (α×k) substripes of source data; and determining each of one or moreof the substripes of at least one of the redundant nodes to be furtherdependent on at least one further substripe of source data that it isnot currently dependent on, wherein said determination comprisesselecting a further substripe of source data for a redundant node to befurther dependent on with the further substripe being selectable fromany one of the k source nodes; and encoding data in accordance with thedetermined systematic coding technique.
 50. The method according toclaim 49, wherein: each of the substripes within each node isidentifiable by substripe index level, i, where 1≤i≤α and i is aninteger; for at least one of the redundant nodes, said determining togenerate each of the substripes of data in dependence on a combinationof a different substripe from each of the source data nodes comprisesdetermining, for each substripe of said at least one of the redundantnodes, the substripe of the redundant node to be a combination of asingle substripe from all of the source data nodes with the substripe ofthe redundant node and the substripes from each source data nodes allhaving the same substripe index level; in said step of determining eachof one or more of the substripes of at least one of the redundant nodesto be further dependent on at least one further substripe of source datathat it is not currently dependent on, the determination comprisesselecting a further substripe of source data for a redundant node to befurther dependent on with the further substripe being selectable fromany one of the α substripe index levels; and, preferably, thedetermination comprises selecting substripes from the source data nodesas further substripes of source data that one or more of the redundantnodes are further dependent on with the selection comprising at leastone substripe from each of the α substripe index levels; wherein, foreach substripe index level, there is at least one substripe of aredundant node that is dependent on an additional substripe of sourcedata with a different substripe index level from the substripe of saidat least one substripe of the redundant node; wherein in said step ofdetermining each of one or more of the substripes of at least one of theredundant nodes to be further dependent on at least one furthersubstripe of source data that it is not currently dependent on, thedetermination comprises, for all but one of redundant nodes, theredundant nodes to be determined in dependence on at least oneadditional substripe of source data; wherein in said step of determiningeach of one or more of the substripes of at least one of the redundantnodes to be further dependent on at least one further substripe ofsource data that it is not currently dependent on, the determinationcomprises selecting at least one substripe from each of the k sourcedata nodes as further substripes of source data that one or more of theredundant nodes are further dependent on; wherein in said step ofdetermining each of one or more of the substripes of at least one of theredundant nodes to be further dependent on at least one furthersubstripe of source data that it is not currently dependent on, thedetermination comprises selecting at least one substripe from each ofthe k source data nodes as further substripes of source data that one ofthe redundant nodes is further dependent on; wherein in said step ofdetermining each of one or more of the substripes of at least one of theredundant nodes to be further dependent on at least one furthersubstripe of source data that it is not currently dependent on, thedetermination comprises, for each of two or more of the redundant nodes,selecting at least one substripe from each of the k source data nodes asfurther substripes of source data that the redundant node is furtherdependent on; wherein (k/r) is not an integer; wherein either r is 2 ormore, or r is 3 or more; wherein either α is 2 or more, or α is 3 ormore; wherein α is determined so that it satisfies the condition1<α<r^(m) and/or (k/r) is not an integer; wherein one of the redundantnodes is determined to have each of its substripes dependent on exactlyk substripes of source data and the substripes of the redundant node aregenerated in dependence on all of the (α×k) substripes of source data;wherein said determining of each of one or more of the substripes of atleast one of the redundant nodes to be further dependent on at least onefurther substripe of source data that it is not currently dependent onis performed for r−1 redundant nodes; wherein determining each of one ormore of the substripes of a redundant node to be further dependent on atleast one further substripe of source data that it is not currentlydependent on is performed for all of the substripes of the node; whereinthe method further comprises performing a balanced selection of thesubstripes that the redundant nodes are further dependent on such thatsubstantially the same number of read operations are required to recovereach source node; wherein the method is computer-implemented; whereinthe determined systematic coding technique is MDS; wherein the combiningof substripes of source data to generate a substripe of a redundant nodeis by linear combinations over finite fields; and wherein the systematiccoding technique is an erasure resilient systematic coding technique.51. The method according to claim 49, wherein (k/r) is an integer. 52.The method according to claim 49, wherein said step of determining eachof one or more of the substripes of at least one of the redundant nodesto be further dependent on at least one further substripe of source datathat it is not currently dependent on is performed in accordance withany of a random determination, a pseudo-random determination, apre-determined technique and/or an algorithm; wherein, when determiningeach of one or more of the substripes of a redundant node to be furtherdependent on at least one further substripe of source data that it isnot currently dependent on, the selection of each further substripe ofsource data is independent from the order of writing and reading theprevious substripes of source data; wherein: said step of determiningsource data nodes comprises determining k source data nodes {d₁, d₂, . .. , d_(k)} where each data node d_(j) comprises an indexed set of αsubstripes {a_(1,j), a_(2,j), . . . , a_(α,j)} as a two-dimensionalarray Data with a rows and k columns such that${{Data} = \begin{bmatrix}a_{1,1} & a_{1,2} & \ldots & a_{1,k} \\a_{2,1} & a_{2,2} & \ldots & a_{2,k} \\\vdots & \vdots & \ddots & \vdots \\a_{\alpha,1} & a_{\alpha,2} & \ldots & a_{\alpha,k}\end{bmatrix}};$ and the generation of the redundant nodes comprises:determining r redundant data nodes {p₁, p₂, . . . , p_(r)} where eachredundant node p₁, where 1≤l≤r, comprises of an indexed set of αsubstripes {p_(1,l), p_(2,l), . . . , p_(α,l)}; determining rtwo-dimensional index arrays P₁, . . . , P_(r); determining the indexarray for P₁ to have α rows and k columns, where each cell in P₁ is apair of indexes with the following values: ${P_{1}\begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \ldots & \left( {1,k} \right) \\\left( {2,1} \right) & \left( {2,2} \right) & \ldots & \left( {2,k} \right) \\\vdots & \vdots & \ddots & \vdots \\\left( {\alpha,1} \right) & \left( {\alpha,2} \right) & \ldots & \left( {\alpha,k} \right)\end{bmatrix}};$ determining the index arrays P₂, . . . , P_(r) to haveα rows and k+m columns, and where each cell in P_(l), where 2≤l≤r, is apair of indexes with the following values: ${P_{l} = \begin{bmatrix}\left( {1,1} \right) & \left( {1,2} \right) & \ldots & \left( {1,k} \right) & \left( {?{,?)}} \right. & \ldots & \left( {?{,?)}} \right. \\\left( {2,1} \right) & \left( {2,2} \right) & \ldots & \left( {2,k} \right) & \left( {?{,?)}} \right. & \ldots & \left( {?{,?)}} \right. \\\vdots & \vdots & \ddots & \cdots & \vdots & \ddots & \left( {?{,?)}} \right. \\\left( {\alpha,1} \right) & \left( {\alpha,2} \right) & \ldots & \left( {\alpha,k} \right) & \left( {?{,?)}} \right. & \ldots & \left( {?{,?)}} \right.\end{bmatrix}},$ where the pairs with values (?,?) are furtherdetermined according to an algorithm; and for each of the redundantnodes {p₁, p₂, . . . , p_(r)}, determining to generate each of thesubstripes p_(i,l) where 1≤i≤α and 1≤l≤r, in dependence on a combinationof different source data substripes a_((j) ₁ _(,j) ₂ ₎, where the pair(j₁,j₂) is present in the i-th row of the index array P_(l).
 53. Amethod for determining how to generate a systematic code and encodingdata in accordance with the determined systematic code, the methodcomprising: receiving the code parameters n, α, k and/or r; configuringan algorithm with the received code parameters; determining, by thealgorithm, how to generate a systematic code according to the method ofclaim 49; and encoding data in accordance with the determined systematiccode; wherein n, k, α are inputs to the algorithm and index arrays P₁, .. . , P_(r) are outputs that define how to generate each of theredundant nodes, and wherein the algorithm performs the steps of:initialising P₁, . . . , P_(r) as arrays P=((i,j))_(α×k); appendingadditional m=┌k/r┐ columns to P₂, . . . , P_(r) all initialized to(0,0);$\left. {{setting}\mspace{14mu} {portion}}\leftarrow\left\lceil \frac{\alpha}{r} \right\rceil \right.;$setting ValidPartitions←∅; setting j←0; repeating the steps of:setting  j ← j + 1;$\left. {{setting}\mspace{14mu} v}\leftarrow\left\lceil \frac{j}{r} \right\rceil \right.;$$\left. {{setting}\mspace{14mu} {run}}\leftarrow\left\lceil \frac{\alpha}{r^{v}} \right\rceil \right.;$$\left. {{setting}\mspace{14mu} {step}}\leftarrow{\left\lceil \frac{\alpha}{r} \right\rceil - {run}} \right.;$determining D_(j) _(j)=ValidPartitioning(ValidPartitions,k,r,portion,run,step,J_(v)); settingValidPartitions=ValidPartitions ∪D_(d) _(j) ; and determining oneD_(ρ,d) _(j) ∈D_(j), such that its elements correspond to row indexes inthe (k+v)-th column in one of the arrays P₂, . . . , P_(r), that are allzero pairs (0,0), wherein the indexes in D_(ρ,d) _(j) are the rowpositions where the pairs (i,j) with indexes i∈D\D_(ρd) _(j) areassigned in the (k+v)-th column of P₂, . . . , P_(r); until (run>1) AND(j≠0 mod r); while j<k, performing the steps of: setting j←j+1; settingv←┌j/r┐; setting run←0; determining D_(d) _(j)=ValidPartitioning(ValidPartitions,k,r,portion,run,step,J_(v)); settingValidPartitions=ValidPartitions ∪D_(d) _(j) ; determining one D_(ρ,d)_(j) ∈D_(d) _(j) such that its elements correspond to row indexes in the(k+v)-th column in one of the arrays P₂, . . . , P_(r) that are all zeropairs (0, 0), wherein the indexes in D_(ρ,d) _(j) are the row positionswhere the elements (i,j) with indexes i∈D\D_(ρ,d) _(j) are assigned inthe (k+v)-th column of P₂, . . . , P_(r); and when the condition j<k isno longer satisfied, outputting the determined P₁, . . . , P_(r);wherein, the steps of the algorithm further comprise: partitioning theset Nodes={d₁, . . . , d_(k)} of k data disks in ┌k/r┐ disjunctivesubsets $J_{1},\ldots \mspace{14mu},J_{\lceil\frac{k}{r}\rceil}$ where|J_(v)|=r and where if r does not divide k then the last subset$J_{\lceil\frac{j}{r}\rceil}$ has k mod r elements and where${{Nodes} = {\bigcup_{v = 1}^{\lceil\frac{k}{r}\rceil}{J_{v}.}}};$wherein the function ValidPartitioning is called by the algorithm andtakes ValidPartitions,k,r,portion,run,step,J_(v) as inputs and outputsD_(d) _(j) ={D_(1,d) _(j) , . . . , D_(r,d) _(j) }; and theValidPartitioning function comprises the steps of: setting D={1, 2, . .. , α}; if run≠0 then finding D_(d) _(j) that satisfies Condition 1 andCondition 2; else finding D_(d) _(j) that satisfies Condition 2; whereCondition 1 is that at least one subset D_(ρ,d) _(j) has portionelements with runs of run consecutive elements separated with a distancebetween the indexes equal to step, wherein the elements of that subsetcorrespond to row indexes in the (k+v)-th column in one of the arraysP₂, . . . , P_(r), that are all zero pairs (0, 0), and the distancebetween two elements in one node is computed in a cyclical manner suchthat the distance between the elements a_(α−1) and a₂ is 2; andCondition 2 is that a necessary condition for the valid partitioning ofthe elements in the systematic nodes to achieve the lowest possiblerepair bandwidth is D_(d_(j₁)) = D_(d_(j₂)) for all d_(j) ₁ and d_(j) ₂in J_(v) and D_(d_(j₁)) ≠ D_(d_(j₂)) for all d_(j) ₁ and d_(j) ₂systematic nodes in the system, and if portion divides α, then D_(ρ,d)_(j) for all d_(j) in the J_(v)-th subset are disjunctive, i.e.,D=∪_(j=1) ^(r) D_(ρ,d) _(j) ={1, 2, . . . , α}; wherein the combining ofsubstripes of source data to generate a substripe p_(i,l) of theredundant node p_(l) is by linear combinations over finite fieldsaccording to the index arrays P₁, . . . , P_(r); and whereinp_(i,l)=Σc_(l,j) ₁ _(,j) ₂ , where 1≤i≤α, 1≤l≤r and the pair (j₁,j₂)exists in the i-th row of the index array P_(l) and co-efficient c_(l,j)₁ _(,j) ₂ is a non-zero element in the finite field.
 54. A method forstoring data in a data storage system, wherein n is the total number ofdata storage nodes of the data storage system, k is the total number ofsource data nodes of the data storage system, r is the total number ofredundant data nodes of the data storage system, such that n=k+r, α isthe number of substripes of data in one of the nodes and each of thesource data nodes and redundant data nodes comprise the same number ofsubstripes, and wherein α satisfies either the condition 1<α<r^(m) orboth of the conditions α=r^(m) and (k/r) is not an integer, wherem=ceiling(k/r), the method comprising: determining how to encode thesource data for storing in the data storage system in dependence on thesystematic coding technique according to claim 49; determining theredundant data by encoding the source data in accordance with thedetermined systematic coding technique; and storing the source data andthe redundant data in the data storage nodes of the data storage system.55. The method according to claim 54, further comprising performing amapping operation on the source data and encoded source data such thatone or more of the data storage nodes stores both source data andencoded source data.
 56. A method of coding source data, the methodcomprising: obtaining source data; and encoding the source data inaccordance with a systematic coding technique determined according toclaim 49; the method further comprising transferring the source data andredundant data over a network.
 57. A computing system configured toperform the method according to claim
 49. 58. A computer program that,when executed by a computing system, causes the computing system toperform the method according to claim
 49. 59. A data storage system,wherein n is the total number of data storage nodes of the data storagesystem, k is the total number of source data nodes of the data storagesystem, r is the total number of redundant data nodes of the datastorage system, such that n=k+r, α is the number of substripes of datain one of the nodes and each of the source data nodes and redundant datanodes comprise the same number of substripes, and wherein α satisfieseither the condition 1<α<r^(m) or both of the conditions α=r^(m) and(k/r) is not an integer, where m=ceiling(k/r), the data storage systemconfigured to store data in accordance with the method of claim
 54. 60.A method of recovering a node that is one of a plurality ofsystematically coded source and redundant nodes, the method comprisingapplying a decoding method that is the inverse of the method accordingto claim
 49. 61. A method of recovering one of a plurality of nodes,wherein the plurality of nodes have been coded according to the methodof claim 49, the method comprising: obtaining a set of$\left\lceil \frac{\alpha}{r} \right\rceil$ substripes of data fromnodes of the plurality of nodes other than the node being recovered;obtaining one or more further substripes of data from nodes of theplurality of nodes other than the node being recovered; using theobtained set of $\left\lceil \frac{\alpha}{r} \right\rceil$ stripes torecover one or more substripes of the node being recovered; andrecovering all of the other substripes of the node being recovered independence on the one or more further substripes and a re-use of theobtained set of $\left\lceil \frac{\alpha}{r} \right\rceil$ substripes.62. A method for recovering one of a plurality of nodes that have beencoded according to the method of claim 49, the method comprising:receiving data defining how the plurality of nodes were coded;configuring an algorithm with the received data; and determining, by thealgorithm, how to recover the node in dependence on data in the other ofthe plurality of nodes; wherein the algorithm recovers a node, d_(l), byperforming the following steps: accessing and transferring$\left( {k - 1} \right)\left\lceil \frac{\alpha}{r} \right\rceil$elements a_(i,j) from all k−1 non-failed systematic nodes and$\left\lceil \frac{\alpha}{r} \right\rceil$ elements p_(i,1) from p₁where i∈D_(ρ,d) _(l) ; repairing a_(i,l)∈D_(ρ,d) _(l) ; accessing andtransferring$\left( {r - 1} \right)\left\lceil \frac{\alpha}{r} \right\rceil$elements p_(i,j) from p₂, . . . , p_(r) where i∈D_(ρ,d) _(j) ; accessingand transferring from the systematic nodes the elements a_(i,j) indexedin the i-th row of the index arrays P₂, . . . , P_(r) where i∈D_(ρ,d)_(j) that are different from said accessed and transferred$\left( {k - 1} \right)\left\lceil \frac{\alpha}{r} \right\rceil$elements a_(i,j) from all k−1 non-failed systematic nodes and$\left\lceil \frac{\alpha}{r} \right\rceil$ elements p_(i,1) from p₁;and repairing a_(i,l) where i∈D\D_(ρ,d) _(l) ; and wherein: p₁, . . . ,p_(r) are redundant data nodes with respective index arrays P₁, . . . ,P_(r).
 63. The method according to claim 62, wherein the nodes are nodesof a data storage system.
 64. A method of reading data from a datastorage system, the method comprising reading data in dependence on themethod according to claim
 62. 65. A method of recovering up to r failednodes from a plurality of systematically coded source and redundantnodes, the method being equivalent to decoding data in dependence on theinverse of the generator matrix for implementing the coding techniqueaccording to claim
 49. 66. The method according to claim 60, wherein themethod is computer-implemented.
 67. A computing system configured toperform the method according to claim
 62. 68. A computer program that,when executed by a computing system, causes the computing system toperform the method according to claim 62.